AITopics | cross-modal contrastive learning

Collaborating Authors

cross-modal contrastive learning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Neural Information Processing SystemsDec-24-2025, 13:47:22 GMT

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the "distributional semantics" but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes "grounded semantics" for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model's language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.

cross-modal contrastive learning, explainable semantic space, grounding language, (9 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Neural Information Processing SystemsJan-18-2025, 00:24:32 GMT

cross-modal contrastive learning, explainable semantic space, grounding language, (5 more...)

Neural Information Processing Systems

Industry: Education > Curriculum > Subject-Specific Education (0.52)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Multi-Scale and Multi-Modal Contrastive Learning Network for Biomedical Time Series

Guo, Hongbo, Xu, Xinzi, Wu, Hao, Wang, Guoxing

arXiv.org Artificial IntelligenceDec-6-2023

Multi-modal biomedical time series (MBTS) data offers a holistic view of the physiological state, holding significant importance in various bio-medical applications. Owing to inherent noise and distribution gaps across different modalities, MBTS can be complex to model. Various deep learning models have been developed to learn representations of MBTS but still fall short in robustness due to the ignorance of modal-to-modal variations. This paper presents a multi-scale and multi-modal biomedical time series representation learning (MBSL) network with contrastive learning to migrate these variations. Firstly, MBTS is grouped based on inter-modal distances, then each group with minimum intra-modal variations can be effectively modeled by individual encoders. Besides, to enhance the multi-scale feature extraction (encoder), various patch lengths and mask ratios are designed to generate tokens with semantic information at different scales and diverse contextual perspectives respectively. Finally, cross-modal contrastive learning is proposed to maximize consistency among inter-modal groups, maintaining useful information and eliminating noises. Experiments against four bio-medical applications show that MBSL outperforms state-of-the-art models by 33.9% mean average errors (MAE) in respiration rate, by 13.8% MAE in exercise heart rate, by 1.41% accuracy in human activity recognition, and by 1.14% F1-score in obstructive sleep apnea-hypopnea syndrome.

dataset, modality, representation, (16 more...)

arXiv.org Artificial Intelligence

2312.03796

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.70)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Cross-modal Contrastive Learning for Multimodal Fake News Detection

Wang, Longzheng, Zhang, Chuang, Xu, Hongbo, Xu, Yongxiu, Xu, Xiaohan, Wang, Siqi

arXiv.org Artificial IntelligenceAug-11-2023

Automatic detection of multimodal fake news has gained a widespread attention recently. Many existing approaches seek to fuse unimodal features to produce multimodal news representations. However, the potential of powerful cross-modal contrastive learning methods for fake news detection has not been well exploited. Besides, how to aggregate features from different modalities to boost the performance of the decision-making process is still an open question. To address that, we propose COOLANT, a cross-modal contrastive learning framework for multimodal fake news detection, aiming to achieve more accurate image-text alignment. To further improve the alignment precision, we leverage an auxiliary task to soften the loss term of negative samples during the contrast process. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention mechanism with an attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations and the cross-modality correlations. Finally, we evaluate the COOLANT and conduct a comparative study on two widely used datasets, Twitter and Weibo. The experimental results demonstrate that our COOLANT outperforms previous approaches by a large margin and achieves new state-of-the-art results on the two datasets.

detection, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3581783.3613850

2302.14057

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
Asia > China (0.05)
North America > United States > Arkansas (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry: Media > News (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Rethinking Audio-visual Synchronization for Active Speaker Detection

Wuerkaixi, Abudukelimu, Zhang, You, Duan, Zhiyao, Zhang, Changshui

arXiv.org Artificial IntelligenceJul-10-2022

Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.

artificial intelligence, machine learning, synchronization, (17 more...)

arXiv.org Artificial Intelligence

2206.10421

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York > Monroe County > Rochester (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback